Chinese Word Frequency Approximation Based on Multitype Corpora

نویسندگان

  • Wei Qiao
  • Maosong Sun
  • Wolfgang Menzel
چکیده

Due to the nature of Chinese, a perfect word-segmented Chinese corpus that is ideal for the task of word frequency estimation may never exist. Therefore, a reliable estimation for Chinese word frequencies remains a challenge. Currently, three types of corpora can be considered for this purpose: raw corpora, automatically word-segmented corpora, and manually word-segmented corpora. As each type has its own advantages and drawbacks, none of them is sufficient alone. In this article, we propose a hybrid scheme which utilizes existing corpora of different types for word frequency approximation. Experiments have been performed from statistical and application-oriented perspectives. We demonstrate that, compared with other schemes, the proposed scheme is the most effective one and leads to better word frequency approximation results.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Word Frequency Approximation for Chinese Using Raw, MM-Segmented and Manually Segmented Corpora

Word frequencies play important roles in many NLP-related applications. Word frequency estimation for Chinese remains a big challenge due to the characteristics of Chinese. An underlying fact is that a perfect word-segmented Chinese corpus never exists, and currently we only have raw corpora, which can be of arbitrarily large size, automatically word-segmented corpora derived from raw corpora, ...

متن کامل

On the Applicability of Zipf's Law in Chinese Word Frequency Distribution

Zipf's Law uncovers the relationship between word frequency and its rank. This paper addresses applicability of Zipf's Law in Chinese word frequency distribution. The previous studies on Zipf’s law in Chinese were primarily based on raw corpus, without word segmentation, hence there are obvious limitations. This study investigates the topic in several large-scale POS-tagged Chinese corpora. The...

متن کامل

Corpus-Based Adaptation Mechanisms for Chinese Homophone Disambiguation

Based on the concepts of bzd~rectwnal converswn and automahc evaluatzon, we propose two user. adaptation mechanzsms, character-preference learn. in9 and pseudo-word learning, for resolving Chinese homophone ambiguities in syllable-to.character conversion. The 1991 Umted Daily corpus of approximately 10 million Chinese characters ts used for extraction of 10 reporter-specific article databases a...

متن کامل

Word Class Discovery For Postprocessing Chinese Handwriting Recognition

This article presents a novel Chinese class n-gram model for contextual postprocessing of haudwriting recognition results. The word classes in the model are automatically discovered by a corpus-based simulated anuealing procedure. Three other language models, least-word, word-frequency, and the powerflfl interword character bigram model, have been constructed for comparison. Extensive experimen...

متن کامل

Vocabulary Lists for EAP and Conversation Students

Despite the abundance of research investigating general and academic vocabularies and developing dozens of word lists, few studies have compared academic vocabulary with general service word lists such as conversation vocabulary. Many EAP researchers assume that university students need to know all the words in West’s (1953) General Service List (GSL) as a prerequisite to academic words (e.g., ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Journal of Quantitative Linguistics

دوره 17  شماره 

صفحات  -

تاریخ انتشار 2010